Visualization and Prediction of 2020 Debates

Workflow

Part I: Visualization

  1. Build a Heat Map of the two candidates and a mediator.
  1. Make a Word Cloud consisting the text of two debates of two candidates.
  1. Perform Sentiment Analysis (polarity and subjectivity) of two candidates.

Part II: Prediction

  1. Naive Bayes
  1. Logistic Regression
  1. Support Vector Classification
  1. Hyper-parameter Tuning & Model Evaluation
  1. Model Comparison

Part I:

1. Import Libraries:

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# For analyzing text
import en_core_web_sm
nlp = en_core_web_sm.load()
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
import string
from textblob import TextBlob

# For vis
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
from PIL import Image
import seaborn as sns
import datetime

import warnings
warnings.filterwarnings('ignore')

2. Clean Data:

  • Load the data; replace the null values; rename names of debaters to 'Donald Trump' and 'Joe Biden' and the mediators to 'mediator_1' and 'mediator_2'.
  • In the row data, first presidential debate has 2 time period and second presidential debate has 3 time period. To build a heatmap, we should make the timeline consistent
In [2]:
first = pd.read_csv('us_election_2020_1st_presidential_debate.csv')
second = pd.read_csv('us_election_2020_2nd_presidential_debate.csv')
In [3]:
first.head()
Out[3]:
speaker minute text
0 Chris Wallace 01:20 Good evening from the Health Education Campus ...
1 Chris Wallace 02:10 This debate is being conducted under health an...
2 Vice President Joe Biden 02:49 How you doing, man?
3 President Donald J. Trump 02:51 How are you doing?
4 Vice President Joe Biden 02:51 I’m well.
In [4]:
second.head()
Out[4]:
speaker minute text
0 Kristen Welker 00:18 Good evening, everyone. Good evening. Thank yo...
1 Donald Trump 07:37 How are you doing? How are you?
2 Kristen Welker 07:58 And I do want to say a very good evening to bo...
3 Kristen Welker 08:27 The goal is for you to hear each other and for...
4 Kristen Welker 09:03 … during this next stage of the coronavirus cr...
In [5]:
# summarize the null in the data
null_df = pd.DataFrame(pd.concat([first.isnull().sum(), second.isnull().sum()], axis = 1))
null_df.columns = ['first', 'second']
null_df
Out[5]:
first second
speaker 0 0
minute 1 0
text 0 0
In [6]:
first.iloc[178:181,:]
Out[6]:
speaker minute text
178 President Donald J. Trump 24:25 You don’t trust Johnson & Johnson, Pfizer?
179 Chris Wallace: NaN Okay, gentlemen, gentlemen. Let me move on to ...
180 President Donald J. Trump 00:15 Well, I’ve spoken to the companies and we can ...
In [7]:
first.loc[first.minute.isnull(), 'minute'] = '00:00'
first.iloc[178:181,:]
Out[7]:
speaker minute text
178 President Donald J. Trump 24:25 You don’t trust Johnson & Johnson, Pfizer?
179 Chris Wallace: 00:00 Okay, gentlemen, gentlemen. Let me move on to ...
180 President Donald J. Trump 00:15 Well, I’ve spoken to the companies and we can ...
In [8]:
# changing their names for more simplicity and coherence in two datasets

print('names in the first dataset:', (first.speaker.unique()))
print('names in the second dataset:', (second.speaker.unique()))

first.loc[first.speaker.str.contains('Chris Wallace:'), 'speaker'] = 'Chris Wallace' # correcting the typo in the name
first.loc[first.speaker.str.contains('Vice President Joe Biden'), 'speaker'] = 'Joe Biden'
first.loc[first.speaker.str.contains('President Donald J. Trump'), 'speaker'] = 'Donald Trump'
first.loc[first.speaker.str.contains('Chris Wallace'), 'speaker'] = 'mediator_1'
second.loc[second.speaker.str.contains('Kristen Welker'), 'speaker'] = 'mediator_2'

print('Modified names in the first dataset:', (first.speaker.unique()))
print('Modified names in the second dataset:', (second.speaker.unique()))
names in the first dataset: ['Chris Wallace' 'Vice President Joe Biden' 'President Donald J. Trump'
 'Chris Wallace:']
names in the second dataset: ['Kristen Welker' 'Donald Trump' 'Joe Biden']
Modified names in the first dataset: ['mediator_1' 'Joe Biden' 'Donald Trump']
Modified names in the second dataset: ['mediator_2' 'Donald Trump' 'Joe Biden']
In [9]:
# making the time consecutive

# First Debate
first['seconds'] = 80 # give initial value of 80 because first sentence is in 1:20

  # (first.minute,1): specify the starting index of the counter(https://book.pythontips.com/en/latest/enumerate.html)
for i, tm in enumerate(first.minute,1):
    timeParts = [int(s) for s in str(tm).split(':')]
    if i <= 179:
        first['seconds'][i-1] = timeParts[0] * 60 + timeParts[1]  
    if i > 179 and i <= 724:
        first['seconds'][i-1] = timeParts[0] * 60 + timeParts[1] + first['seconds'][178]
    if i > 724:
        first['seconds'][i-1] = (timeParts[0] * 60 + timeParts[1]) * 60 + timeParts[2] + first['seconds'][178]


# Second Debate
second['seconds'] = 18 # give initial value of 80 because first sentence is in 0:18

for i, tm in enumerate(second.minute,1):
    timeParts = [int(s) for s in str(tm).split(':')]
    if i <= 89:
        second['seconds'][i-1] = timeParts[0] * 60 + timeParts[1]
    if i <= 337 and i > 89:
        second['seconds'][i-1] = timeParts[0] * 60 + timeParts[1] + second['seconds'][88]
    if i > 337:
        second['seconds'][i-1] = timeParts[0] * 60 + timeParts[1] + second['seconds'][336]

first['minutes'] = first.seconds.apply(lambda x:x//60)
second['minutes'] = second.seconds.apply(lambda x:x//60)

# We use this format of %h:%m:%s by using the following command
first['time'] = first.seconds.apply(lambda x:str(datetime.timedelta(seconds=x)))
second['time'] = second.seconds.apply(lambda x:str(datetime.timedelta(seconds=x)))
In [10]:
# column 'seconds', 'minutes', 'time' are all the time when people begin to speak
first.head()
Out[10]:
speaker minute text seconds minutes time
0 mediator_1 01:20 Good evening from the Health Education Campus ... 80 1 0:01:20
1 mediator_1 02:10 This debate is being conducted under health an... 130 2 0:02:10
2 Joe Biden 02:49 How you doing, man? 169 2 0:02:49
3 Donald Trump 02:51 How are you doing? 171 2 0:02:51
4 Joe Biden 02:51 I’m well. 171 2 0:02:51
In [11]:
second.head()
Out[11]:
speaker minute text seconds minutes time
0 mediator_2 00:18 Good evening, everyone. Good evening. Thank yo... 18 0 0:00:18
1 Donald Trump 07:37 How are you doing? How are you? 457 7 0:07:37
2 mediator_2 07:58 And I do want to say a very good evening to bo... 478 7 0:07:58
3 mediator_2 08:27 The goal is for you to hear each other and for... 507 8 0:08:27
4 mediator_2 09:03 … during this next stage of the coronavirus cr... 543 9 0:09:03

3. Heat Map:

  • The lightness of the color shows how many times the candidates were interupting in each others' speach (the darker the color, the more times each one started talking, or even interupted each other).
  • The heat map gives us a clear expression of when the debate reached its climax.
In [12]:
heat = first.groupby(['minutes', 'speaker']).count().reset_index()
fig = go.Figure(data=go.Heatmap(
                z=heat.minute,
                x=heat.minutes,
                y=heat.speaker,
                colorscale='sunset', #https://plotly.com/python/builtin-colorscales/
                colorbar=dict(
                title="Heat of the discussion",
                titleside="top",
                tickmode="array",
                tickvals=[1, 4, 10],
                ticktext=["Cool", "Normal", "Hot"],
                ticks="outside"
    )
        ))

fig.update_layout(title='First Debate: # of times each one talks in each minute',
                 xaxis_nticks=36)


fig.show()

# Create and show figure
In [13]:
heat2 = second.groupby(['minutes', 'speaker']).count().reset_index()
fig2 = go.Figure(data=go.Heatmap(
        z=heat2.minute,
        x=heat2.minutes,
        y=heat2.speaker,
        colorscale='sunset',
        colorbar=dict(
        title="Heat of the discussion",
        titleside="top",
        tickmode="array",
        tickvals=[2, 5, 8],
        ticktext=["Cool", "Normal", "Hot"],
        ticks="outside"
    )
        ))

fig2.update_layout(title='Second Debate: # of times each one talks in each minute',
                 xaxis_nticks=36)

fig2.show()

# Create and show figure

4. Word Cloud:

  • Combine the text of two debates for each candidate.
  • Clean the text and plot the word cloud.
  • Remove some meaningless words from the plotted word cloud based my own judgment and replot the word cloud using masks of two candidates' picture.
  • To get a better visualization, you can also load the clean text in WordArt.com
In [14]:
Biden1 = first[first.speaker=='Joe Biden']
Biden2 = second[second.speaker=='Joe Biden']
Trump1 = first[first.speaker=='Donald Trump']
Trump2 = second[first.speaker=='Donald Trump']
# combine 2 debates
Biden_text = pd.concat([Biden1.text, Biden2.text], axis = 0) 
Trump_text = pd.concat([Trump1.text, Trump2.text], axis = 0)
# change list to str
Biden_text = " ".join(txt for txt in Biden_text)
Trump_text = " ".join(txt for txt in Trump_text)
In [15]:
def textclean(text):
    # tokenization
    words=word_tokenize(text)
    # lower the word
    words_lower=[w.lower() for w in words]
    # remove punctuation
    table=str.maketrans('','',string.punctuation)
    strpp=[w.translate(table) for w in words_lower]
    Words_lower=[word for word in strpp if word.isalpha()]
    # remove stopwords
    stop_words=set(stopwords.words('english'))
    Words_lower=[w for w in Words_lower if not w in stop_words]
    # lemmatize verbs and nouns (https://blog.csdn.net/weixin_33963594/article/details/88726982?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.control)
    wordnet_lemmatizer = WordNetLemmatizer()
    Words_lower = [wordnet_lemmatizer.lemmatize(w, 'v') for w in Words_lower]
    Words_lower = [wordnet_lemmatizer.lemmatize(w, 'n') for w in Words_lower]
    return(Words_lower)
In [16]:
# Create and generate a word cloud image:
# ' '.join(Words_lower): change to str
wordcloud = WordCloud(width = 600, height = 400, background_color="white").generate(' '.join(textclean(Biden_text)))

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
In [17]:
# Create and generate a word cloud image:
# ' '.join(Words_lower): change to str
wordcloud = WordCloud(width = 600, height = 400, background_color="white").generate(' '.join(textclean(Trump_text)))

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
In [18]:
# remove some meaningless words from the plotted word cloud
customized_remove_string = ['see', 'go', 'want', 'know', 'think', 'way', 'make', 'take', 'thing', 'say', 'let']
textclean_Trump_text =[w for w in textclean(Trump_text) if not w in customized_remove_string]
textclean_Biden_text =[w for w in textclean(Biden_text) if not w in customized_remove_string]
In [19]:
# replot the word cloud using masks of two candidates' picture

mask = np.array(Image.open("Trump.png"))
wordcloud = WordCloud(background_color="white", mode="RGBA", max_words=1000, mask=mask).generate(' '.join(textclean_Trump_text))

# create coloring from image
image_colors = ImageColorGenerator(mask)
plt.figure(figsize=[7,7])
plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear")
plt.axis("off")

# store to file
# plt.savefig("Trump_wordcloud.png", format="png", dpi=200)

plt.show()
In [20]:
mask = np.array(Image.open("Biden.png"))
wordcloud = WordCloud(background_color="white", mode="RGBA", max_words=1000, mask=mask).generate(' '.join(textclean_Biden_text))

# create coloring from image
image_colors = ImageColorGenerator(mask)
plt.figure(figsize=[7,7])
plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear")
plt.axis("off")

# store to file
# plt.savefig("Biden_wordcloud.png", format="png", dpi=200)

plt.show()

5. Sentiment Analysis:

  • Get the number of sentences used by each person, each time; create a new dataframe containing all sentence.
  • Compute polarity & subjectivity of each sentence.

    polarity: negative vs. positive (-1.0 => +1.0)

    subjectivity: objective vs. subjective (+0.0 => +1.0)

  • Group the polarity based on the value by:

    negative [-1.0, -0.6]

    somewhat negative (-0.6, -0.2]

    neutral (-0.2, 0.2]

    somewhat positive (0.2, 0.6]

    positive (0.6, 1.0]

  • Plot the pie chart showing the percentage of different groups of sentences each candidate uses.
  • Sort out the the negative and positive sentences.
  • Plot the histogram showing the distribution of each candidate's subjectivity.
In [21]:
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

# number of sentences used by each person, each time their allowed to talk
first['number_of_sents'] = first.text.apply(lambda x:len(sent_detector.tokenize(x)))
second['number_of_sents'] = second.text.apply(lambda x:len(sent_detector.tokenize(x)))
In [22]:
# number of sentences in each cell
lens_1 = first.number_of_sents
lens_2 = second.number_of_sents

# making a long list of all sentences
list_1 =[]
list_2 = []
for x in first.text.apply(lambda x:sent_detector.tokenize(x)):
    list_1.extend(x)
for x in second.text.apply(lambda x:sent_detector.tokenize(x)):
    list_2.extend(x)

# create new dataframes, repeating as appropriate
first_sent = pd.DataFrame({'speaker': np.repeat(first.speaker, lens_1),
                            'time': np.repeat(first.time, lens_1),
                            'sent': list_1})
second_sent = pd.DataFrame({'speaker': np.repeat(second.speaker, lens_2),
                            'time': np.repeat(second.time, lens_2),
                            'sent': list_2})
first_sent.head()
Out[22]:
speaker time sent
0 mediator_1 0:01:20 Good evening from the Health Education Campus ...
0 mediator_1 0:01:20 I’m Chris Wallace of Fox News and I welcome yo...
0 mediator_1 0:01:20 This debate is sponsored by the Commission on ...
0 mediator_1 0:01:20 The Commission has designed the format, six ro...
0 mediator_1 0:01:20 Both campaigns have agreed to these rules.
In [23]:
# first df
  # compute polarity & subjectivity
first_sent['polarity'] = first_sent.sent.apply(lambda x: TextBlob(x).polarity)
first_sent['subjectivity'] = first_sent.sent.apply(lambda x: TextBlob(x).subjectivity)
  # group the polarity
first_sent['sentiment'] = first_sent.polarity.apply(lambda x: 'positive' if x>0.6 else 'somewhat positive' if x>0.2 else 'neutral' if x>-0.2 else 'somewhat negative' if x>-0.6  else 'negative')

# second df
  # compute polarity & subjectivity
second_sent['polarity'] = second_sent.sent.apply(lambda x: TextBlob(x).polarity)
second_sent['subjectivity'] = second_sent.sent.apply(lambda x: TextBlob(x).subjectivity)
  # group the polarity
second_sent['sentiment'] = second_sent.polarity.apply(lambda x: 'positive' if x>0.6 else 'somewhat positive' if x>0.2 else 'neutral' if x>-0.2 else 'somewhat negative' if x>-0.6  else 'negative')

first_sent.reset_index(drop = True, inplace = True)
second_sent.reset_index(drop = True, inplace = True)
In [24]:
# plot the pie chart showing the percentage of different groups of words each candidate uses
summery_sentiment_first = first_sent.groupby(['speaker', 'sentiment']).count().reset_index()
Trump_sentiment_first = summery_sentiment_first.loc[summery_sentiment_first.speaker == "Donald Trump"].polarity

labels =  'somewhat positive/negative', 'neutral', 'positive/negative'
sizes = [Trump_sentiment_first[3] + Trump_sentiment_first[4], Trump_sentiment_first[1], Trump_sentiment_first[0] + Trump_sentiment_first[2]]
colors = ['yellowgreen', 'lightskyblue', 'lightcoral']
explode = (0, 0, 0)

# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.title('Polarity analysis of Trump first debate')
plt.axis('equal')
plt.show()
In [25]:
Biden_sentiment_first = summery_sentiment_first.loc[summery_sentiment_first.speaker == "Joe Biden"].polarity

labels =  'somewhat positive/negative', 'neutral', 'positive/negative'
sizes = [Biden_sentiment_first[8] + Biden_sentiment_first[9], Biden_sentiment_first[6], Biden_sentiment_first[5] + Biden_sentiment_first[7]]
colors = ['yellowgreen', 'lightskyblue', 'lightcoral']
explode = (0, 0, 0)

# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.title('Polarity analysis of Biden first debate')
plt.axis('equal')
plt.show()
In [26]:
# sort out the the negative and positive sentences

cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True, )

first_sent.loc[(first_sent['polarity']>=0.6) | (first_sent['polarity']<=-0.6),['speaker', 'sent', 'polarity']]\
.head(15).style.background_gradient(cmap, subset=['polarity'])
Out[26]:
speaker sent polarity
30 Donald Trump Good in every way. 0.700000
37 Donald Trump She’s going to be as good as anybody that has served on that court. 0.700000
118 Joe Biden I’m happy to talk about this. 0.800000
154 Donald Trump Because they want to give good healthcare. 0.700000
156 Joe Biden Good healthcare. 0.700000
166 Donald Trump That was the worst part of Obamacare. -1.000000
168 Donald Trump Chris, that was the worst part of Obamacare. -1.000000
184 Donald Trump I’m cutting drug prices. -0.600000
189 Donald Trump So we’re cutting healthcare. -0.600000
201 mediator_1 Sir, you’ll be happy. 0.800000
280 Donald Trump If I run it badly, they’ll probably blame him, but they’ll blame me. -0.700000
293 Donald Trump Good. 0.700000
349 Joe Biden Good luck. 0.700000
370 Joe Biden He told us what a great job Xi was doing. 0.800000
414 Donald Trump Fewer people are dying when they get sick. -0.714286
In [27]:
# plot the histogram showing the distribution of each candidate's subjectivity
both = pd.concat([first_sent, second_sent], axis = 0)
fig = go.Figure()
fig.add_trace(go.Histogram(
    x=both[both.speaker == 'Donald Trump'].subjectivity,
    name='Trump',  xbins=dict(start=-1, end=2, size=0.1),
    marker_color='red', opacity=0.75))

fig.add_trace(go.Histogram(
    x=both[both.speaker == 'Joe Biden'].subjectivity,
     name='Biden', xbins=dict(start=-1, end=2, size=0.1),
    marker_color='#3498DB', opacity=0.75))

fig.update_layout(
    title_text="Number of Sentences used by Debaters with different Subjectivities",
    yaxis_title_text='Number of Sentences', 
    xaxis_title_text='Subjectivity',
    bargap=0.1, bargroupgap=0.1)

Part II:

1. Import Libraries:

In [28]:
import sklearn
from sklearn.model_selection import train_test_split
# Encode target labels with value between 0 and n_classes-1
from sklearn.preprocessing import LabelEncoder
# AUC is in fact often preferred over accuracy for binary classification:
# https://datascience.stackexchange.com/questions/806/advantages-of-auc-vs-standard-accuracy#:~:text=AUC%20and%20accuracy%20are%20fairly%20different%20things.&text=For%20a%20given%20choice%20of,is%20already%20measuring%20something%20else.
# https://www.quora.com/Why-is-AUC-a-better-measure-of-an-algorithms-performance-than-accuracy
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report, accuracy_score
# CountVectorizer will convert a collection of text documents to a sparse matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.model_selection import GridSearchCV

2. Data Preparation

In [29]:
both1 = both.copy()
Trump_sent =both1.loc[both1['speaker'] == 'Donald Trump']
Biden_sent =both1.loc[both1['speaker'] == 'Joe Biden']
two_sent = pd.concat([Trump_sent, Biden_sent], axis = 0)

X = two_sent['sent']
y = two_sent['speaker']

# Trump belongs to class 0 and Biden belongs to class 1
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y_encoded,
                                                    test_size=0.2,
                                                    random_state=1)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

cv = CountVectorizer()
X_train_vectorized = cv.fit_transform(X_train)
X_train = X_train_vectorized
X_test_vectorized = cv.transform(X_test)
X_test = X_test_vectorized
# CountVectorizer has already been fitted with the training data. 
# So for your test data, you just want to call transform(), not fit_transform()
# https://stackoverflow.com/questions/45804133/dimension-mismatch-error-in-countvectorizer-multinomialnb
(2194,) (549,) (2194,) (549,)

4. Confusion Matrix

In [30]:
NB = MultinomialNB(alpha=0.1)
NB.fit(X_train, y_train)

#making predictions & looking at Accuracy score
predictions = NB.predict(X_test)
print('Accuracy score:',accuracy_score(y_test, predictions))
Accuracy score: 0.7704918032786885
In [31]:
# confusion matrix
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
# 
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
print(pd.DataFrame(confusion_matrix(y_test, predictions),
             columns=['Predicted Trump', "Predicted Biden"], index=['Actual Trump', 'Actual Biden']))

print(f'\nTrue Positives: {tp}')
print(f'False Positives: {fp}')
print(f'True Negatives: {tn}')
print(f'False Negatives: {fn}')

print(f'\nAccuracy: { ((tp + tn) / (tp + tn + fp + fn))}')
              Predicted Trump  Predicted Biden
Actual Trump              244               55
Actual Biden               71              179

True Positives: 179
False Positives: 55
True Negatives: 244
False Negatives: 71

Accuracy: 0.7704918032786885
In [32]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
print(confusion_matrix(y_test, predictions))
print('------------------------------------------------------')
print(classification_report(y_test, predictions))
[[244  55]
 [ 71 179]]
------------------------------------------------------
              precision    recall  f1-score   support

           0       0.77      0.82      0.79       299
           1       0.76      0.72      0.74       250

    accuracy                           0.77       549
   macro avg       0.77      0.77      0.77       549
weighted avg       0.77      0.77      0.77       549

5. GridSearchCV for Model Tuning

In [33]:
def gridSearchCV(model, params):
    """
    @param    model: sklearn estimator
    @param    params (dict): Dictionary of possible parameters
    """
    model_cv = GridSearchCV(model, param_grid=params, scoring='roc_auc', cv=5)
    model_cv.fit(X_train, y_train)
    cv_results = pd.DataFrame(model_cv.cv_results_)[['params', 'mean_test_score']]
    tuned_param = "Tuned Parameters: {}".format(model_cv.best_params_)
    best_score = "Best score is {}".format(model_cv.best_score_)
    return cv_results,tuned_param,best_score

6. Model Evaluation

In [34]:
def evaluate(model, plotROC=False):
    """
    1. Plot ROC AUC of the test set
    2. Return the best threshold
    """
    model.fit(X_train, y_train)
    probs = model.predict_proba(X_test)
    preds = probs[:,1]
    fpr, tpr, threshold = roc_curve(y_test, preds)
    auc_score = roc_auc_score(y_test, preds)
    print(f'AUC: {auc_score:.4f}')#rounding digit
    
    # Find optimal threshold
    rocDf = pd.DataFrame({'fpr': fpr, 'tpr':tpr, 'threshold':threshold})
    rocDf['tpr - fpr'] = rocDf.tpr - rocDf.fpr
    OptimalThreshold = rocDf.threshold[rocDf['tpr - fpr'].idxmax()]
    print(f'OptimalThreshold: {OptimalThreshold:.4f}')
    
    # Get accuracy over the test set
    accuracy = accuracy_score(y_test, model.predict(X_test))
    print(f'Accuracy: {accuracy*100:.2f}%')
    
    # Plot ROC AUC
    if plotROC:
        plt.title('Receiver Operating Characteristic')
        plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % auc_score)
        plt.legend(loc = 'lower right')
        plt.plot([0, 1], [0, 1],'r--')
        plt.xlim([0, 1])
        plt.ylim([0, 1])
        plt.ylabel('True Positive Rate')
        plt.xlabel('False Positive Rate')
        plt.show()

7. MultinomialNB

In [35]:
params = {'alpha': [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,\
                        1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4]}
NB = MultinomialNB()
gridSearchCV(NB, params)
Out[35]:
(            params  mean_test_score
 0   {'alpha': 0.1}         0.830339
 1   {'alpha': 0.2}         0.834872
 2   {'alpha': 0.3}         0.836770
 3   {'alpha': 0.4}         0.837904
 4   {'alpha': 0.5}         0.837962
 5   {'alpha': 0.6}         0.837802
 6   {'alpha': 0.7}         0.837469
 7   {'alpha': 0.8}         0.837194
 8   {'alpha': 0.9}         0.836979
 9   {'alpha': 1.0}         0.836532
 10  {'alpha': 1.1}         0.835903
 11  {'alpha': 1.2}         0.835212
 12  {'alpha': 1.3}         0.834638
 13  {'alpha': 1.4}         0.834077
 14  {'alpha': 1.5}         0.833484
 15  {'alpha': 1.6}         0.832854
 16  {'alpha': 1.7}         0.832293
 17  {'alpha': 1.8}         0.831703
 18  {'alpha': 1.9}         0.831180
 19  {'alpha': 2.0}         0.830648
 20  {'alpha': 2.1}         0.830125
 21  {'alpha': 2.2}         0.829383
 22  {'alpha': 2.3}         0.828733
 23  {'alpha': 2.4}         0.828084,
 "Tuned Parameters: {'alpha': 0.5}",
 'Best score is 0.8379622182617972')
In [36]:
evaluate(MultinomialNB(alpha=0.5), plotROC=True)
AUC: 0.8519
OptimalThreshold: 0.5525
Accuracy: 76.50%

8. LogisticRegression

In [37]:
params = {'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}

lr = LogisticRegression()
gridSearchCV(lr,params)
Out[37]:
(         params  mean_test_score
 0  {'C': 0.001}         0.682396
 1   {'C': 0.01}         0.728972
 2    {'C': 0.1}         0.783241
 3      {'C': 1}         0.811268
 4     {'C': 10}         0.808383
 5    {'C': 100}         0.791840
 6   {'C': 1000}         0.778282,
 "Tuned Parameters: {'C': 1}",
 'Best score is 0.811267992272108')
In [38]:
evaluate(LogisticRegression(C = 1), plotROC=True)
AUC: 0.8289
OptimalThreshold: 0.4910
Accuracy: 75.41%

9. SVC

In [39]:
params = {'gamma':[0.1, 1, 10, 100], 
           'C':[0.1, 1, 10, 100, 1000]}
clf = svm.SVC()
gridSearchCV(clf,params)
Out[39]:
(                       params  mean_test_score
 0    {'C': 0.1, 'gamma': 0.1}         0.721931
 1      {'C': 0.1, 'gamma': 1}         0.691443
 2     {'C': 0.1, 'gamma': 10}         0.620016
 3    {'C': 0.1, 'gamma': 100}         0.551087
 4      {'C': 1, 'gamma': 0.1}         0.771027
 5        {'C': 1, 'gamma': 1}         0.705132
 6       {'C': 1, 'gamma': 10}         0.619962
 7      {'C': 1, 'gamma': 100}         0.551034
 8     {'C': 10, 'gamma': 0.1}         0.784236
 9       {'C': 10, 'gamma': 1}         0.706251
 10     {'C': 10, 'gamma': 10}         0.620194
 11    {'C': 10, 'gamma': 100}         0.551074
 12   {'C': 100, 'gamma': 0.1}         0.775742
 13     {'C': 100, 'gamma': 1}         0.706264
 14    {'C': 100, 'gamma': 10}         0.620194
 15   {'C': 100, 'gamma': 100}         0.551074
 16  {'C': 1000, 'gamma': 0.1}         0.775742
 17    {'C': 1000, 'gamma': 1}         0.706264
 18   {'C': 1000, 'gamma': 10}         0.620194
 19  {'C': 1000, 'gamma': 100}         0.551074,
 "Tuned Parameters: {'C': 10, 'gamma': 0.1}",
 'Best score is 0.7842362086685479')
In [40]:
evaluate(svm.SVC(C = 10, gamma = 0.1, probability=True), plotROC=True)
AUC: 0.7761
OptimalThreshold: 0.5266
Accuracy: 72.68%

10. Model Comparison

In [41]:
label = ['Naive Bayes', 'Logistic Regression', 'Support Vector Classification']
auclist = [0.8519, 0.8289, 0.7762]

#generates an array of length label and use it on the X-axis
def plot_bar_x():
    # this is for plotting purpose
    index = np.arange(len(label))
    clrs = ['grey' if (x < max(auclist)) else 'red' for x in auclist ]
    g=sns.barplot(x=index, y=auclist, palette=clrs) # color=clrs)   
    plt.xlabel('Model type', fontsize=10)
    plt.ylabel('AUC score', fontsize=10)
    plt.xticks(index, label, fontsize=10, rotation=30)
    plt.title('AUC score for each fitted model')
    ax=g
    for p in ax.patches:
             ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='center', fontsize=11, color='gray', xytext=(0, 20),
                 textcoords='offset points')
    g.set_ylim(0,1.25) #To make space for the annotations

plot_bar_x()

Reference:

In [ ]: